Skip to content

Conversation

@aseembits93
Copy link
Contributor

📄 30% (0.30x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

⏱️ Runtime : 329 milliseconds 252 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 30% speedup through two key algorithmic improvements in aggregate_embedded_text_by_block and supplement_layout_with_ocr_elements:

Key Optimizations

1. Replaced .sum(axis=1).astype(bool) with .any(axis=1)

This change appears in both functions when computing boolean masks from the result of bboxes1_is_almost_subregion_of_bboxes2():

Why it's faster:

  • .sum(axis=1) creates an intermediate integer array by counting True values across columns, then converts to boolean
  • .any(axis=1) short-circuits on the first True value per row, avoiding the full summation
  • Eliminates the explicit .astype(bool) conversion overhead

Performance impact: Based on line profiler, the mask computation in aggregate_embedded_text_by_block dropped from ~234ms to ~222ms (5% faster), and the overall function improved from 551ms to 443ms (19.6% faster).

2. Avoided redundant slicing operations

In aggregate_embedded_text_by_block, the optimized code stores sliced = source_regions.slice(mask) once and reuses it, instead of calling source_regions.slice(mask) three separate times:

Why it's faster:

  • Each slice() operation creates a new object with coordinate and text array copies
  • Line profiler shows the original made 3 separate slice calls (48ms + 25ms + 34ms = 107ms total)
  • The optimized version makes 1 slice call (~28ms), saving ~79ms per invocation

3. Early exit with mask.any()

The optimized code checks if mask.any(): before processing, avoiding unnecessary work when no regions match:

Why it's faster:

  • Skips text joining, bbox extraction, and IOU calculations when mask is empty
  • Particularly beneficial for the 368 cases (31% of calls) where no matching regions exist

Impact Based on Test Results

The optimization is particularly effective for workloads with:

  1. Many elements requiring text aggregation (10-41% speedup on tests with 100-500 elements)

    • test_large_scale_many_elements_aggregated: 77ms → 67.2ms (14.6% faster)
    • test_merge_large_number_of_elements: 43.8ms → 31.0ms (41.3% faster)
    • test_merge_boundary_coordinates_large_scale: 87.3ms → 61.5ms (41.8% faster)
  2. Documents with invalid text patterns (10-20% speedup)

    • test_invalid_texts_are_replaced: 250μs → 222μs (12.4% faster)
    • test_merge_with_all_invalid_text: 654μs → 554μs (18.1% faster)
  3. Complex spatial matching scenarios (33-36% speedup)

    • test_merge_with_overlapping_elements: 25.2ms → 18.9ms (33.3% faster)
    • test_merge_with_varied_subregion_thresholds: 78.6ms → 57.6ms (36.4% faster)

Context Impact

The function merge_out_layout_with_ocr_layout is called from supplement_page_layout_with_ocr in OCR processing hot paths, specifically when ocr_mode == OCRMode.FULL_PAGE. Each page processed invokes this function once, making the 30% speedup directly translate to faster document processing throughput for PDF/image partitioning workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 24 Passed
🌀 Generated Regression Tests 29 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_code 2.55ms 2.26ms 12.6%✅
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout 2.32ms 2.05ms 13.2%✅
🌀 Click to see Generated Regression Tests
import numpy as np  # used to build arrays for coordinates and texts
from unstructured_inference.constants import IsExtracted
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

# imports
from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


# Helper constructors: attempt several plausible constructor signatures for the domain classes.
# This keeps tests robust across minor constructor variations in the real codebase while still
# using the real classes (no stubs).
def _construct_layout_elements(element_coords, texts, sources=None):
    """
    Try to construct a LayoutElements object using a few reasonable constructor signatures.
    Raise RuntimeError with helpful debugging info if none work.
    """
    # Normalize inputs to numpy arrays where appropriate
    ec = np.asarray(element_coords, dtype=float)
    txts = np.asarray(texts, dtype=object)

    possible_args = [
        {
            "element_coords": ec,
            "texts": txts,
            "sources": np.asarray(sources or [], dtype=object),
            "element_class_ids": np.zeros(txts.shape),
            "element_class_id_map": {},
        },
        (ec, txts, np.asarray(sources or [], dtype=object)),
        (ec, txts),
    ]

    last_exc = None
    for args in possible_args:
        try:
            if isinstance(args, dict):
                return LayoutElements(**args)
            else:
                return LayoutElements(*args)
        except Exception as e:
            last_exc = e
            continue

    # If we get here, none of the constructors worked; dump useful info for debugging
    raise RuntimeError(
        "Could not instantiate LayoutElements with tried constructor signatures. "
        f"Last exception: {last_exc!r}"
    )


def _construct_text_regions(element_coords, texts, sources=None, is_extracted=None):
    """
    Try to construct a TextRegions instance with several likely constructor signatures.
    """
    ec = np.asarray(element_coords, dtype=float)
    txts = np.asarray(texts, dtype=object)
    sources_arr = np.asarray(sources or [], dtype=object)
    # default is_extracted flags if not provided
    if is_extracted is None:
        is_extracted = np.array([IsExtracted.TRUE] * txts.shape[0], dtype=object)
    else:
        is_extracted = np.asarray(is_extracted, dtype=object)

    possible_args = [
        {
            "element_coords": ec,
            "texts": txts,
            "sources": sources_arr,
            "is_extracted_array": is_extracted,
        },
        (ec, txts, sources_arr, is_extracted),
        (ec, txts, sources_arr),
    ]

    last_exc = None
    for args in possible_args:
        try:
            if isinstance(args, dict):
                return TextRegions(**args)
            else:
                return TextRegions(*args)
        except Exception as e:
            last_exc = e
            continue

    raise RuntimeError(
        "Could not instantiate TextRegions with tried constructor signatures. "
        f"Last exception: {last_exc!r}"
    )


def test_returns_out_layout_when_out_or_ocr_empty():
    """
    Basic: Ensure the function returns early/unmodified when either input collection is empty.
    - Case A: out_layout is empty (should return out_layout immediately even if ocr_layout not empty).
    - Case B: ocr_layout is empty (should return out_layout immediately).
    """
    # Build an empty LayoutElements (0 boxes)
    empty_coords = np.zeros((0, 4))
    empty_texts = np.array([], dtype=object)

    out_empty = _construct_layout_elements(empty_coords, empty_texts)
    # Build a tiny OCR layout with one element to test early-return in Case A
    ocr_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    ocr_texts = np.array(["some text"], dtype=object)
    ocr_nonempty = _construct_text_regions(ocr_coords, ocr_texts)

    # Case A: out_layout empty -> should return out_layout (empty)
    codeflash_output = merge_out_layout_with_ocr_layout(out_empty, ocr_nonempty)
    result_a = codeflash_output  # 1.97μs -> 2.01μs (1.89% slower)

    # Case B: ocr_layout empty -> should return out_layout unchanged (non-empty out_layout)
    out_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    out_texts = np.array(["valid"], dtype=object)
    out_nonempty = _construct_layout_elements(out_coords, out_texts)
    ocr_empty = _construct_text_regions(empty_coords, empty_texts)

    codeflash_output = merge_out_layout_with_ocr_layout(out_nonempty, ocr_empty)
    result_b = codeflash_output  # 1.56μs -> 1.42μs (9.70% faster)


def test_valid_texts_not_modified_when_supplement_false():
    """
    Basic: If out_layout.texts are already valid, the function should not modify them.
    Use supplement_with_ocr_elements=False to avoid additional OCR supplementation behavior.
    """
    # Single element in out layout with valid ASCII text
    coords = np.array([[0.1, 0.1, 0.5, 0.5]])
    texts = np.array(["Already good text"], dtype=object)
    out_layout = _construct_layout_elements(coords, texts)

    # OCR layout with some text (should be ignored as supplement_with_ocr_elements=False)
    ocr_coords = np.array([[0.1, 0.1, 0.5, 0.5]])
    ocr_texts = np.array(["ocr text"], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts)

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 10.9μs -> 10.0μs (8.94% faster)


def test_invalid_texts_are_replaced_with_aggregated_ocr_text():
    """
    Edge: If out_layout contains invalid text (empty string or containing '(cid:'), the function
    should attempt to aggregate OCR text into that element.
    We create coordinates that exactly match so aggregation selects the OCR region.
    """
    # Create a single out element with invalid text (empty string)
    coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    invalid_texts = np.array([""], dtype=object)  # empty -> invalid per valid_text
    out_layout = _construct_layout_elements(coords, invalid_texts)

    # OCR layout contains a matching box with real text we expect to be aggregated
    ocr_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    ocr_texts = np.array(["Aggregated OCR text"], dtype=object)
    # Ensure OCR regions are marked as extracted (this helps make aggregate_embedded_text_by_block
    # set IsExtracted.TRUE conditions if logic depends on it).
    ocr_is_extracted = np.array([IsExtracted.TRUE], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    # Perform merge and verify the empty string was replaced by OCR text
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 250μs -> 222μs (12.4% faster)


def test_invalid_cid_texts_are_treated_as_invalid_and_replaced():
    """
    Edge: Strings containing '(cid:' should be considered invalid by valid_text and replaced
    with OCR-aggregated text.
    """
    coords = np.array([[0.2, 0.2, 0.8, 0.8]])
    # containing (cid: should be invalid
    invalid_texts = np.array(["(cid:1234)"], dtype=object)
    out_layout = _construct_layout_elements(coords, invalid_texts)

    ocr_coords = np.array([[0.2, 0.2, 0.8, 0.8]])
    ocr_texts = np.array(["Replaced text"], dtype=object)
    ocr_is_extracted = np.array([IsExtracted.TRUE], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 258μs -> 218μs (18.3% faster)


def test_large_scale_many_elements_aggregated_without_supplementing():
    """
    Large-scale: Create many (but under 1000) out elements with initially invalid text (empty),
    and corresponding OCR regions with matching boxes and unique texts. Ensure merge handles
    a moderate sized input and replaces all invalid texts.
    This verifies scalability and that the aggregation loop works across many elements.
    """
    n = 100  # keep under the 1000-element limit specified in the instructions
    # Build n identical boxes for simplicity (exact match ensures subregion tests pass)
    out_coords = np.tile(np.array([[0.0, 0.0, 1.0, 1.0]]), (n, 1))
    out_texts = np.array([""] * n, dtype=object)  # all invalid initially
    out_layout = _construct_layout_elements(out_coords, out_texts)

    # Build corresponding OCR boxes and unique OCR texts for aggregation
    ocr_coords = np.tile(np.array([[0.0, 0.0, 1.0, 1.0]]), (n, 1))
    ocr_texts = np.array([f"ocr_text_{i}" for i in range(n)], dtype=object)
    # Mark all OCR regions as extracted to make the aggregated text more likely to be considered fully_filled
    ocr_is_extracted = np.array([IsExtracted.TRUE] * n, dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    # Run the merge without supplementing (we only want aggregation)
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 77.0ms -> 67.2ms (14.6% faster)
    for i in range(n):
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

from unstructured.documents.elements import ElementType
from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


def test_merge_empty_out_layout_returns_original():
    """Test that empty out_layout returns immediately without processing."""
    # Create empty out_layout and non-empty ocr_layout
    empty_out_layout = LayoutElements(
        element_coords=np.array([]).reshape(0, 4),
        texts=np.array([], dtype=object),
        sources=np.array([], dtype=object),
        element_class_ids=np.array([], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call function with supplement_with_ocr_elements=False to skip that logic
    codeflash_output = merge_out_layout_with_ocr_layout(
        empty_out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 2.33μs -> 2.30μs (1.57% faster)


def test_merge_empty_ocr_layout_returns_original():
    """Test that empty ocr_layout returns original out_layout without modification."""
    # Create non-empty out_layout and empty ocr_layout
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    empty_ocr_layout = TextRegions(
        element_coords=np.array([]).reshape(0, 4),
        texts=np.array([], dtype=object),
        sources=np.array([], dtype=object),
    )

    # Call function with supplement_with_ocr_elements=False
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, empty_ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 2.70μs -> 2.77μs (2.45% slower)


def test_merge_valid_text_skips_aggregation():
    """Test that elements with valid text are not modified by OCR aggregation."""
    # Create out_layout with valid text (no "(cid:" pattern)
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Valid text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR replacement"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 11.1μs -> 10.9μs (2.30% faster)


def test_merge_invalid_text_gets_replaced():
    """Test that elements with invalid text (containing '(cid:') trigger OCR aggregation."""
    # Create out_layout with invalid text containing "(cid:"
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["This (cid:1234) invalid"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 268μs -> 244μs (10.0% faster)


def test_merge_empty_string_is_invalid():
    """Test that empty strings are treated as invalid and trigger OCR aggregation."""
    # Create out_layout with empty text
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array([""], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR provides text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 265μs -> 220μs (20.4% faster)


def test_merge_multiple_elements_mixed_validity():
    """Test merging with multiple elements having mixed valid/invalid text."""
    # Create out_layout with multiple elements
    out_layout = LayoutElements(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(["Valid text", "Invalid (cid:999)", "Another valid"], dtype=object),
        sources=np.array([None, None, None], dtype=object),
        element_class_ids=np.array([0, 0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [50, 60, 70, 80],  # Overlaps with second element
            ]
        ),
        texts=np.array(["OCR replacement"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 253μs -> 224μs (12.8% faster)


def test_merge_with_none_values_in_texts():
    """Test handling of None values in text arrays."""
    # Create out_layout with None values
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array([None], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Should handle None gracefully and aggregate from OCR
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 254μs -> 220μs (15.1% faster)


def test_merge_subregion_threshold_parameter():
    """Test that custom subregion_threshold is passed correctly to aggregation."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 100, 100]]),
        texts=np.array(["Invalid (cid:123)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[11, 21, 50, 50]]),  # Small region inside out_layout
        texts=np.array(["Small OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Use very high threshold (should not match)
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False, subregion_threshold=0.95
    )
    result = codeflash_output  # 249μs -> 230μs (8.16% faster)


def test_merge_with_supplement_false():
    """Test that supplement_with_ocr_elements=False does not add new OCR elements."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # OCR layout with element not covered by out_layout
    ocr_layout = TextRegions(
        element_coords=np.array([[200, 300, 400, 500]]),
        texts=np.array(["Uncovered OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 10.7μs -> 10.0μs (7.28% faster)


def test_merge_with_supplement_true_adds_uncovered_ocr():
    """Test that supplement_with_ocr_elements=True adds uncovered OCR elements."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # OCR layout with element not covered by out_layout
    ocr_layout = TextRegions(
        element_coords=np.array([[200, 300, 400, 500]]),
        texts=np.array(["Uncovered OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 863μs -> 879μs (1.81% slower)


def test_merge_text_array_dtype_converted_to_object():
    """Test that text array dtype is converted to object before modification."""
    # Create out_layout with strings dtype
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40], [50, 60, 70, 80]]),
        texts=np.array(["Valid", "Invalid (cid:1)"], dtype=object),
        sources=np.array([None, None], dtype=object),
        element_class_ids=np.array([0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[50, 60, 70, 80]]),
        texts=np.array(["OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Function should handle dtype conversion gracefully
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 252μs -> 225μs (12.0% faster)


def test_merge_with_all_invalid_text():
    """Test case where all out_layout elements have invalid text."""
    out_layout = LayoutElements(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(
            [
                "Invalid (cid:1)",
                "Also (cid:2) invalid",
                "(cid:3) starts with invalid",
            ],
            dtype=object,
        ),
        sources=np.array([None, None, None], dtype=object),
        element_class_ids=np.array([0, 0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(["OCR1", "OCR2", "OCR3"], dtype=object),
        sources=np.array([None, None, None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 654μs -> 554μs (18.1% faster)
    for text in result.texts:
        pass


def test_merge_with_single_element():
    """Test merge operation with single element in layouts."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["(cid:invalid)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Single OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    result = merge_out_layout_with_ocr_elements = False


def test_merge_with_large_cid_patterns():
    """Test handling of very long or complex (cid:) patterns."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Text with (cid:" + "1234567890" * 10 + ") pattern"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Clean OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 258μs -> 222μs (16.0% faster)


def test_merge_multiple_ocr_regions_for_single_layout_element():
    """Test aggregation when multiple OCR regions map to single layout element."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 100, 100]]),
        texts=np.array(["(cid:invalid)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # Multiple OCR regions within the layout element
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [15, 25, 40, 40],
                [50, 60, 80, 80],
            ]
        ),
        texts=np.array(["First", "Second"], dtype=object),
        sources=np.array([None, None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 263μs -> 231μs (14.1% faster)


def test_merge_large_number_of_elements():
    """Test merge performance with large number of layout elements."""
    # Create 500 layout elements with mixed valid/invalid text
    n_elements = 500
    coords = np.array([[i * 10, i * 10, i * 10 + 20, i * 10 + 20] for i in range(n_elements)])
    texts = np.array(
        ["Valid text" if i % 2 == 0 else f"Invalid (cid:{i})" for i in range(n_elements)],
        dtype=object,
    )
    sources = np.array([None] * n_elements, dtype=object)
    class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=coords,
        texts=texts,
        sources=sources,
        element_class_ids=class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create OCR layout with fewer elements
    ocr_coords = np.array(
        [[i * 10 + 1, i * 10 + 1, i * 10 + 19, i * 10 + 19] for i in range(0, n_elements, 2)]
    )  # 250 elements
    ocr_texts = np.array([f"OCR text {i}" for i in range(len(ocr_coords))], dtype=object)
    ocr_sources = np.array([None] * len(ocr_coords), dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge function
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 43.8ms -> 31.0ms (41.3% faster)


def test_merge_large_ocr_layout():
    """Test merge with large OCR layout (500+ elements)."""
    # Create small out_layout
    out_layout = LayoutElements(
        element_coords=np.array([[0, 0, 1000, 1000]]),
        texts=np.array(["Invalid (cid:test)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create large OCR layout with 600 elements
    n_ocr_elements = 600
    ocr_coords = np.array(
        [
            [i % 100 * 10, i // 100 * 100, i % 100 * 10 + 50, i // 100 * 100 + 50]
            for i in range(n_ocr_elements)
        ]
    )
    ocr_texts = np.array([f"OCR element {i}" for i in range(n_ocr_elements)], dtype=object)
    ocr_sources = np.array([None] * n_ocr_elements, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge function
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 3.31ms -> 3.04ms (8.83% faster)


def test_merge_with_supplement_large_uncovered_ocr():
    """Test supplement logic with many uncovered OCR elements."""
    # Create out_layout with only one element
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create large OCR layout with mostly uncovered elements
    n_uncovered = 450
    ocr_coords = np.array(
        [
            [1000 + i * 10, 1000 + i * 10, 1000 + i * 10 + 20, 1000 + i * 10 + 20]
            for i in range(n_uncovered)
        ]
    )
    ocr_texts = np.array([f"Uncovered OCR {i}" for i in range(n_uncovered)], dtype=object)
    ocr_sources = np.array([None] * n_uncovered, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge with supplement enabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 4.87ms -> 4.84ms (0.619% faster)


def test_merge_with_overlapping_elements():
    """Test merge with multiple overlapping layout and OCR elements."""
    # Create out_layout with overlapping elements
    n_layout = 300
    out_coords = np.array([[i * 5, i * 5, i * 5 + 50, i * 5 + 50] for i in range(n_layout)])
    out_texts = np.array(
        ["Invalid (cid:x)" if i % 3 == 0 else f"Valid {i}" for i in range(n_layout)], dtype=object
    )
    out_sources = np.array([None] * n_layout, dtype=object)
    out_class_ids = np.array([0] * n_layout, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create overlapping OCR layout
    n_ocr = 250
    ocr_coords = np.array([[i * 7, i * 7, i * 7 + 40, i * 7 + 40] for i in range(n_ocr)])
    ocr_texts = np.array([f"OCR {i}" for i in range(n_ocr)], dtype=object)
    ocr_sources = np.array([None] * n_ocr, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 25.2ms -> 18.9ms (33.3% faster)
    # Invalid texts should be replaced or aggregated
    for i, text in enumerate(result.texts):
        pass


def test_merge_boundary_coordinates_large_scale():
    """Test merge with elements at document boundaries at large scale."""
    # Create layout elements distributed across large coordinate space
    n_elements = 400
    out_coords = np.array(
        [
            (
                [0, 0, 100, 100]
                if i == 0  # Top-left corner
                else (
                    [10000, 10000, 10100, 10100]
                    if i == 1  # Bottom-right corner
                    else [i * 50, i * 50, i * 50 + 50, i * 50 + 50]
                )
            )  # Others
            for i in range(n_elements)
        ]
    )
    out_texts = np.array(["(cid:invalid)" for _ in range(n_elements)], dtype=object)
    out_sources = np.array([None] * n_elements, dtype=object)
    out_class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create OCR with matching coordinates
    ocr_coords = out_coords[:300]  # Use same coordinates for subset
    ocr_texts = np.array([f"OCR text {i}" for i in range(300)], dtype=object)
    ocr_sources = np.array([None] * 300, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 87.3ms -> 61.5ms (41.8% faster)
    # No invalid patterns should remain
    for text in result.texts:
        if text:
            pass


def test_merge_with_varied_subregion_thresholds_many_elements():
    """Test merge with different subregion thresholds on large dataset."""
    n_elements = 300
    out_coords = np.array([[i * 20, i * 20, i * 20 + 100, i * 20 + 100] for i in range(n_elements)])
    out_texts = np.array(["Invalid (cid:test)" for _ in range(n_elements)], dtype=object)
    out_sources = np.array([None] * n_elements, dtype=object)
    out_class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # OCR layout with small regions inside each out_layout element
    ocr_coords = np.array(
        [[i * 20 + 10, i * 20 + 10, i * 20 + 40, i * 20 + 40] for i in range(n_elements)]
    )
    ocr_texts = np.array([f"Small OCR {i}" for i in range(n_elements)], dtype=object)
    ocr_sources = np.array([None] * n_elements, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Test with different thresholds
    for threshold in [0.1, 0.5, 0.9]:
        codeflash_output = merge_out_layout_with_ocr_layout(
            out_layout,
            ocr_layout,
            supplement_with_ocr_elements=False,
            subregion_threshold=threshold,
        )
        result = codeflash_output  # 78.6ms -> 57.6ms (36.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-merge_out_layout_with_ocr_layout-mkrn264u and push.

Codeflash Static Badge

codeflash-ai bot and others added 3 commits January 24, 2026 01:36
The optimized code achieves a **30% speedup** through two key algorithmic improvements in `aggregate_embedded_text_by_block` and `supplement_layout_with_ocr_elements`:

## Key Optimizations

### 1. **Replaced `.sum(axis=1).astype(bool)` with `.any(axis=1)`**
This change appears in both functions when computing boolean masks from the result of `bboxes1_is_almost_subregion_of_bboxes2()`:

**Why it's faster:**
- `.sum(axis=1)` creates an intermediate integer array by counting True values across columns, then converts to boolean
- `.any(axis=1)` short-circuits on the first True value per row, avoiding the full summation
- Eliminates the explicit `.astype(bool)` conversion overhead

**Performance impact:** Based on line profiler, the mask computation in `aggregate_embedded_text_by_block` dropped from ~234ms to ~222ms (5% faster), and the overall function improved from 551ms to 443ms (19.6% faster).

### 2. **Avoided redundant slicing operations**
In `aggregate_embedded_text_by_block`, the optimized code stores `sliced = source_regions.slice(mask)` once and reuses it, instead of calling `source_regions.slice(mask)` three separate times:

**Why it's faster:**
- Each `slice()` operation creates a new object with coordinate and text array copies
- Line profiler shows the original made 3 separate slice calls (48ms + 25ms + 34ms = 107ms total)
- The optimized version makes 1 slice call (~28ms), saving ~79ms per invocation

### 3. **Early exit with `mask.any()`**
The optimized code checks `if mask.any():` before processing, avoiding unnecessary work when no regions match:

**Why it's faster:**
- Skips text joining, bbox extraction, and IOU calculations when mask is empty
- Particularly beneficial for the 368 cases (31% of calls) where no matching regions exist

## Impact Based on Test Results

The optimization is particularly effective for workloads with:

1. **Many elements requiring text aggregation** (10-41% speedup on tests with 100-500 elements)
   - `test_large_scale_many_elements_aggregated`: 77ms → 67.2ms (14.6% faster)
   - `test_merge_large_number_of_elements`: 43.8ms → 31.0ms (41.3% faster)
   - `test_merge_boundary_coordinates_large_scale`: 87.3ms → 61.5ms (41.8% faster)

2. **Documents with invalid text patterns** (10-20% speedup)
   - `test_invalid_texts_are_replaced`: 250μs → 222μs (12.4% faster)
   - `test_merge_with_all_invalid_text`: 654μs → 554μs (18.1% faster)

3. **Complex spatial matching scenarios** (33-36% speedup)
   - `test_merge_with_overlapping_elements`: 25.2ms → 18.9ms (33.3% faster)
   - `test_merge_with_varied_subregion_thresholds`: 78.6ms → 57.6ms (36.4% faster)

## Context Impact

The function `merge_out_layout_with_ocr_layout` is called from `supplement_page_layout_with_ocr` in OCR processing hot paths, specifically when `ocr_mode == OCRMode.FULL_PAGE`. Each page processed invokes this function once, making the 30% speedup directly translate to faster document processing throughput for PDF/image partitioning workflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant